# Verify the approximation
c(exp(0.06), 1 + 0.06)[1] 1.061837 1.060000
SSPS4102 Data Analytics in the Social Sciences
SSPS6006 Data Analytics for Social Research
Semester 1, 2026
Last updated: 2026-01-23
I would like to acknowledge the Traditional Owners of Australia and recognise their continuing connection to land, water and culture. The University of Sydney is located on the land of the Gadigal people of the Eora Nation. I pay my respects to their Elders, past and present.
By the end of this lecture, you will be able to:
TSwD (Alexander)
ROS (Gelman et al.)
Key Assumptions
The data you are analysing should map to the research question you are trying to answer.
This means:
Common Pitfall
A model of test scores will not necessarily tell you about child intelligence or cognitive development. A model of incomes will not necessarily tell you about total assets.
The key assumption is that the data are representative of the distribution of the outcome \(y\) given the predictors \(x_1, x_2, \ldots\)
Important Distinction
For example: in a regression of earnings on height and sex, it’s acceptable for women and tall people to be overrepresented, but problems arise if too many rich people are in the sample.
The deterministic component is a linear function of the separate predictors:
\[y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots\]
When violated:
The simple regression model assumes that the errors from the prediction line are independent.
This assumption is violated in:
Heteroscedasticity = unequal error variance
Least Important Assumption!
The normality assumption is typically barely important at all for estimating the regression line.
It is relevant when: - Predicting individual data points - Constructing prediction intervals
We do not recommend routine Q-Q plots of residuals. Focus on the more important assumptions first!
Graphics are helpful for:
# Simulated example
set.seed(853)
n <- 100
mom_iq <- rnorm(n, 100, 15)
kid_score <- 25 + 0.6 * mom_iq +
rnorm(n, 0, 18)
df <- data.frame(mom_iq, kid_score)
# Fit model
fit <- lm(kid_score ~ mom_iq, data = df)
# Plot
ggplot(df, aes(x = mom_iq, y = kid_score)) +
geom_point(alpha = 0.6) +
geom_smooth(method = "lm", se = TRUE) +
labs(x = "Mother's IQ Score",
y = "Child's Test Score") +
theme_minimal(base_size = 14)# Extract coefficients and simulate
coefs <- coef(fit)
se <- summary(fit)$coefficients[, 2]
# Create plot with uncertainty bands
ggplot(df, aes(x = mom_iq, y = kid_score)) +
geom_point(alpha = 0.5) +
# Add several simulated lines
geom_abline(intercept = coefs[1] + rnorm(10, 0, se[1]),
slope = coefs[2] + rnorm(10, 0, se[2]),
alpha = 0.2, colour = "grey50") +
geom_smooth(method = "lm", se = FALSE,
colour = "blue", linewidth = 1) +
labs(x = "Mother's IQ Score",
y = "Child's Test Score",
title = "Regression with uncertainty") +
theme_minimal(base_size = 14)The residuals are the differences between observed and predicted values:
\[r_i = y_i - \hat{y}_i = y_i - X_i\hat{\beta}\]
Why Plot Residuals?
If the model is correct, residuals should look randomly scattered around a horizontal line at zero. This is often easier to assess than comparing data to a fitted line.
# Calculate residuals and fitted values
df$fitted <- fitted(fit)
df$residuals <- residuals(fit)
# Residual plot
ggplot(df, aes(x = fitted, y = residuals)) +
geom_point(alpha = 0.6) +
geom_hline(yintercept = 0,
linetype = "dashed",
colour = "red") +
geom_hline(yintercept = c(-summary(fit)$sigma,
summary(fit)$sigma),
linetype = "dotted",
colour = "grey50") +
labs(x = "Fitted Values",
y = "Residuals",
title = "Residual Plot") +
theme_minimal(base_size = 14)Good Signs ✓
Warning Signs ✗
Key Insight
Always plot residuals against fitted values, not observed values!
Plotting residuals vs. observed values will show misleading patterns even when the model is correct.
Why? The errors \(\epsilon_i\) should be independent of the predictors \(x_i\), not the data \(y_i\).
When additivity and linearity are violated, transformations can help:
When to Use Log Transform
Use logarithms for outcomes that are all positive and where effects are likely multiplicative rather than additive.
A linear model on the log scale: \[\log y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \epsilon_i\]
Corresponds to a multiplicative model on the original scale: \[y_i = B_0 \cdot B_1^{x_{i1}} \cdot B_2^{x_{i2}} \cdots E_i\]
where \(B_j = e^{\beta_j}\)
For small coefficients (roughly \(|\beta| < 0.25\)):
\[e^\beta \approx 1 + \beta\]
# Simulated earnings data
set.seed(123)
n <- 200
height <- rnorm(n, 170, 10)
earnings <- exp(6 + 0.02 * height +
rnorm(n, 0, 0.5))
df_earn <- data.frame(height, earnings)
# Compare models
fit_linear <- lm(earnings ~ height,
data = df_earn)
fit_log <- lm(log(earnings) ~ height,
data = df_earn)
# Plot on log scale
ggplot(df_earn, aes(x = height, y = earnings)) +
geom_point(alpha = 0.5) +
scale_y_log10() +
geom_smooth(method = "lm") +
labs(x = "Height (cm)",
y = "Earnings (log scale)") +
theme_minimal(base_size = 14)Centering: Subtract the mean \[x_{\text{centered}} = x - \bar{x}\]
Standardising: Subtract mean, divide by standard deviation \[z = \frac{x - \bar{x}}{s_x}\]
Why Standardise by 2 SD?
Dividing by 2 standard deviations makes continuous variable coefficients comparable to binary (0/1) variable coefficients.
(Intercept) mom_iq
25.7404791 0.5862104
(Intercept) mom_iq_c
83.2629997 0.5862104
The slope is identical, but the intercept now represents the predicted score at the average mother’s IQ.
Use when log transformation is too strong:
The Claim
In 2015, Case and Deaton published that mortality rates for middle-aged white non-Hispanic Americans increased from 1999 to 2013.
The problem: Their numbers were “not age-adjusted within the 10-year 45–54 age group.”
The issue: The composition of the 45-54 age group changed as the baby boom generation moved through.
During 1999-2013:
Result: Even if age-specific mortality rates were constant, the group mortality rate would increase due to compositional change.
After adjusting for age composition:
Lesson Learned
Data adjustment is not merely academic. It can fundamentally change the interpretation of data and conclusions.
Tables can communicate specific values with high fidelity:
kable()| Mother’s IQ | Child Score | Fitted | Residuals |
|---|---|---|---|
| 94.6 | 75.0 | 81.2 | -6.1 |
| 99.4 | 99.8 | 84.0 | 15.8 |
| 73.3 | 76.7 | 68.7 | 8.0 |
| 83.2 | 70.0 | 74.5 | -4.5 |
| 85.0 | 67.2 | 75.5 | -8.4 |
modelsummarymodelsummary| Linear | Quadratic | |
|---|---|---|
| (Intercept) | 25.74 | 149.03 |
| (10.91) | (56.45) | |
| mom_iq | 0.59 | -1.95 |
| (0.11) | (1.15) | |
| I(mom_iq^2) | 0.01 | |
| (0.01) | ||
| Num.Obs. | 100 | 100 |
| R2 | 0.224 | 0.262 |
| R2 Adj. | 0.216 | 0.247 |
| AIC | 832.7 | 829.7 |
| BIC | 840.5 | 840.2 |
| Log.Lik. | -413.362 | -410.875 |
| RMSE | 15.10 | 14.73 |
Keys to Good Tables
Key Insight
Writing is a process of rewriting. The critical task is to get to a first draft as quickly as possible.
A quantitative paper typically includes:
Core Sections
Key Principles
An abstract should cover (in ~4-5 sentences):
“Sense of Place”
The data section should give readers such a clear picture of the data that they feel as if they themselves were present.
Include:
Adapted from Zinsser (1976) and Alexander (2023)
\[\hat{\sigma} = \sqrt{\frac{\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}{n-k}}\]
The proportion of variance “explained” by the model:
\[R^2 = 1 - \frac{\hat{\sigma}^2}{s_y^2}\]
Problem: Using the same data to fit and evaluate leads to optimism.
Solution: Leave-one-out (LOO) cross-validation
LOO \(R^2\) gives a more honest assessment of predictive performance.
simple with_noise
0.2243996 0.2268193
\(R^2\) increased, but is the model actually better? Cross-validation would reveal the truth.
| Task | Function |
|---|---|
| Fit linear model | lm() |
| Get residuals | residuals() or resid() |
| Get fitted values | fitted() |
| Model summary | summary() |
| Log transform | log() |
| Square root | sqrt() |
| Create tables | kable(), modelsummary() |
Key Takeaways
Week 9: Logistic Regression
Readings